[1] 2
Writing functions in R
StratoBayes workshop, 18th - 20th March 2024, Durham, UK
takes input –> does something –> returns output
A function needs a name, arguments in (), and a body in {}
subtract <- function(arg1, arg2) {
arg1 - arg2
}
Imagine calculating the mean without standard functions like mean or sum:
Arguments need to be provided in the correct order, or specified by name:
Make function use more convenient, can hide complexities.
Additional, optional arguments can be allowed by using ‘…’ as the last argument:
A function generally should return something, but this does not:
Return explicitly with return, or place return value at the end of the function:
This did not work as intended. R functions only return one object. Instead use lists or other data structures:
Write a function that adds two values and creates a scatter plot with first value on the x-axis and the result on the y axis.
plot function to create a scatter plot with x being the first value, and y being the object we created with the addtion.
Expand the function you just created to let it return the result after plotting
return function.
Expand the function further by allowing customisation of the plot by changing graphical parameters.
...) as a function argument in the outer and the inner function to allow for additional arguments being passed.
Custom binary operators – let’s define an operator for “not in”:
if a condition is true, do something.
else instructs what to do when the if condition is not met.
Instead of many if and else statements, try switch
Loops are used for repeating similar actions multiple times. for loops iterate over a set of values. The iterator (i) changes with every iteration of the loop:
To generate sequences of integers, we can use seq_len. Let’s make a function:
while loops repeat a task until a condition is no longer met.
Create a for loop that calculates the mean of all numerical columns of the Talat_isotopes dataset. If the dataset is not already loaded, we can read it with:
To check if a column is numeric, we can use is.numeric on the column. Using if and possibly else allows us to do different operations depending on the outcome of is.numeric.
Selecting a column can be done with square brackets, e.g. Talat_isotopes[ ,1] selects the first column.
Talat_isotopes, we can use the ncol function.
for (i in seq_len(ncol(Talat_isotopes))) { } to run it for one iteration per number of columns. Everything inside the curly brackets is done exactly the same in each iteration, except i changes. To print the output at every iteration of the function, we can use the print function.
From the loop created in exercise 1, create a function that takes the Talat_isotopes dataframe (or any other dataframe) as input and returns the means of numerical columns as output.
NA output by indexing with !is.na.
numericColumnMean <- function(df) {
# save number of columns
nCol <- ncol(df)
# create a vector for the output
output <- rep(NA,nCol)
# run loop for nCol iterations
for (i in seq_len(nCol)) {
if(is.numeric(df[,i])) {
output[i] <- mean(df[,i])
} else { # NA for non-numeric columns:
output[i] <- NA
}
}
# return output that is not NA
output[!is.na(output)]
}
# apply function
numericColumnMean(Talat_isotopes)[1] -0.335762 -10.781942 411.571816 519.526138 205.526785
apply and related functions apply a function to elements of arrays, lists, …For example, to get the classes of the first three columns of the Talat_isotopes data:
Let’s have a look what happened there. apply(X, MARGIN, FUN) takes an array X (our Talat_isotopes dataframe), and applies a function (FUN) to elements of that array, specified by MARGIN.
MARGIN = 2 indicates columns, MARGIN = 1 would be rows. So here we applied the class function to every column of Talat_isotopes.
apply simplifies the output, so here it returned a vector with one element for each column.
lapply is similar to apply but for list or vector input. It returns a list for each element of the data.
vapply is a safer version of sapply, it requires the user to specify the anticipated class and length of the elements of the output:
In exercise 2, we built a function that calculated the mean of all numerical columns of the Talat_isotopes dataset and returned it.
Can we achieve the same using apply?
This task is trickier than it looks like as apply coerces all columns to type character as soon as there is even one character column. We can add as.numeric to try and force columns into numeric. Better yet, we just select the columns containing values that can be coerced to be numeric:
The second argument of apply, MAR, needs to be set to 2 for columnwise operations.
In the third argument, we can define a function, for example function(col) { } will modify each column according to the instructions in curly brackets, refering to the column as col.
The Talat_isotopes and Talat_elements datasets are contained in a mulit-sheet Excel file from the supplementary materials Maloof et al. (2010). It is in the data folder and named Maloof_et_al_2010_SM.xls.
Reading such files with R can be a challenge and it is often easier to copy-paste the relevant data into clean spreadsheets and save them as .csv files.
However, here we try to automate the reading of sheets from this particular file, starting from the sheet Morocco-Talat n’ Yissi`.
We start by installing and loading the readxl package.
We notice that the Morocco-Talat n' Yissi sheet has two header rows, and the formatting is messy.
We read in the header rows of the sheet 4. Morocco-Talat n' Yissi:
# specify the file path
file_path <- "data/Maloof_et_al_2010_SM.xls"
# specify the sheet name
sheet_name <- "4. Morocco-Talat n' Yissi"
# read the top two rows
# trim_ws stops omitting empty cells
# col_names = FALSE stops cells being used as column names
header_rows <- read_excel(file_path, sheet = sheet_name, n_max = 2,
trim_ws = FALSE, col_names = FALSE)Next, we merge the header rows as we want only a single column name per column:
Finally, we use these column names for reading the entire data set, skipping the first two problematic rows:
The result looks reasonable:
# A tibble: 6 × 4
`0_sample` d13C d18O d13Corg
<chr> <dbl> <dbl> <lgl>
1 M250-2.1 -2.07 -7.95 NA
2 M250-2.6 -1.78 -10.7 NA
3 M250-3.6 -2.04 -10.9 NA
4 M250-4.4 -2.34 -9.53 NA
5 M250-6.1 -2.04 -8.92 NA
6 M250-7.5 -1.59 -6.84 NA
Exercise: Can you build a function from the above and apply it to other sheets of the Maloof_et_al_2010_SM.xls file? We can also build a second function that uses this function to read multiple sheets at once.
We can almost copy-paste the lines we used to read the sheet earlier and collate them in a function.
We need two arguments: the file_path, and the sheet_index.
lapply is useful if we want to use that function on multiple sheets at once.
Here is a function that reads a single sheet:
read_sheets_Maloof <- function(file_path, sheet_index) {
# read the top two rows
header_rows <- read_excel(file_path, sheet = sheet_index, n_max = 2,
trim_ws = FALSE, col_names = FALSE)
# merge to column names
column_names <- apply(header_rows, 2, function(x) {
paste(na.omit(x), collapse = "_")
})
# read the rest of the file
output <- read_excel(file_path, sheet = sheet_index,
skip = 2, col_names = column_names)
output
}Using lapply, we can build a simple function that uses the previous function to read multiple sheets at once and saves them as a list of dataframes. Here, we use the indices of the sheets, rather than the names, so we don’t have to type out the complicated sheet names.
merge.
Environment can be conceptualised as a place where objects with a name and value exist.
Each function, for loop, …, creates its own environment.
If we run the following function to assign to b the value of a
and then look for b in the global environment
we get an error because b only existed within the function environment.
More on environments: adv-r.hadley.nz/environments.html
R uses scoping rules to look for variables (or functions). If a variable is not found in a function environment, R looks in the parent environment (i.e. the environment in which the function was created).
x is a free variable in the double_x function – it is not supplied to or defined in the function. Instead, it’s looked up in the environment where double_x was created, the global environment.
This can get tricky, see here for more details: bookdown.org/rdpeng/rprogdatascience/scoping-rules-of-r.html
vapply to return the three highest d13C values for every stage in Talat_isotopesAs a starting point, extract the unique stage names from the Talat_isotopes dataframe using the unique function:
We can use this as our first argument to vapply, and use subset to select only the values of d13C that are from a specific stage.
The sort function can be useful to sort the d13C values in decreasing order, for example:
Instead of passing Talat_isotopes as an argument, we can let it look up in the global environment, where it was defined.
The third argument of vapply needs to specify the class and length of the output. In our case, it will be three numeric values: numeric(3)
The first solution takes advantage of R functions being able to find variables that were defined outside the function environment:
# build a custom function for vapply
three_largest_stages <- function(s) {
# subset to only include rows from the current stage
# note: we haven't passed Talat_isotopes as a function argument
Talat_isotopes <- subset(Talat_isotopes, stage == s)
# extract three largest values
three_largest(Talat_isotopes$d13C)
}
# put everything together
vapply(unique_stages, FUN = three_largest_stages,
FUN.VALUE = numeric(3)) T A B
[1,] 0.74 3.46 1.79
[2,] 0.37 3.44 1.78
[3,] 0.34 3.43 1.76
vapply has returned the output as a matrix.
The function above works because Talat_isotopes is accessed from the global environment, but only modified within the function environment.
Caution: This can lead to confusion and it is often better to avoid this.
A cleaner solution would be to pass Talat_isotopes as a second argument to the three_largest_stages function.
Very helpful in complex functions
Check that input is correct and display custom error messages:
Use if you anticipate an error but want function to continue.
Let’s try to generate data from a multivariate normal distribution:
mvnfast::rmvn is fast but fails for some problematic sigma values. In case it fails, we use MASS::mvrnorm instead:
If something went wrong, find out where using traceback():
Break points allow you to look inside your function’s environment. This works best when the function is in a separate script. Click next to a line of code in your function to activate a break point (a red dot appears):
You may need to save the script and source it (press source in the Rstudio toolbar)
We can now browse the function environment in the console like we normally can browse the global environment. For example we can look at sigma:
Press Stop to end the browsing. Don’t forget to deactivate the break point by clicking on the red dot in the script.
do.call() takes a function and a list of arguments, and applies the function to the arguments. This can be useful in a variety of situations:
To combine data:
With dynamic arguments:
For example, we want to allow the user to specify additional arguments for a custom plot function, including xlim. In case he doesn’t specify it, we want to automatically generate xlim values:
my_plot <- function(x, ...) {
args <- list(...) # save the additional arguments
if("xlim" %in% names(args)) {
xlim <- args$xlim # if user provides xlim, use that
args$xlim <- NULL # to avoid argument duplication
} else {
xlim <- range(x) + c(-1, 1) # automatic xlim
}
do.call(plot, c(list(x = x, xlim = xlim), args))
}If you have large data sets and complex functions, you may want to enhance their performance.
The microbenchmark package performs an operation many times, and measures the average time it takes. You can also compare different operations.
The profvis package lets you identify bottlenecks in your code:
Only spend time trying to make your code faster if
Here is a good overview on making R functions run faster: Best coding practices in R
Pre-allocating memory is faster than growing objects repeatedly.
Assume, we have recorded the results of 1,000 dice rolls:
Let’s check which approach is faster:
Unit: nanoseconds
expr min lq mean median uq max neval
is_six_1(data) 1200 1301 3295.097 1301 1500 1857400 1000
is_six_2(data) 700 801 2599.012 901 1000 1650201 1000
is_six_2() is faster, as R doesn’t have to grow the res object in every iteration.
R has many functions that are vectorised.
Unit: nanoseconds
expr min lq mean median uq max neval
is_six_2(data) 800 901 2908.908 901 1000 1939702 1000
is_six_3(data) 100 201 779.819 201 202 542801 1000
The vectorised version is_six_3() is much faster.
Vectorised matrix functions like rowSums(), colSums() or rowMeans()can lead to impressive speedups:
data <- matrix(rnorm(10^4), nrow = 100)
microbenchmark::microbenchmark(
apply(data, 2, sum),
colSums(data),
times = 1000)Unit: microseconds
expr min lq mean median uq max neval
apply(data, 2, sum) 123.700 143.301 211.38145 159.351 213.501 9480.201 1000
colSums(data) 5.701 6.301 8.79838 6.601 7.001 1694.601 1000
The Rcpp package allows for integrating C++ code with R, which can make R functions a lot faster. It requires the installation of a C++ compiler (R tools for Windows, Xcode for Mac, possibly “sudo apt-get install r-base-dev” on Linux)
Read more at Rcpp: Seamless R and C++ Integration or Avanced R
I also highly recommend ChatGpt for help with creating C++ functions.
As an example, let’s compare a random walk implemented in R with one implemented with Rcpp.
Next, the Rcpp version:
We now have the function random_walk_Rcpp in the global environment.
Let’s make sure both versions work:
Now let us compare the speed:
Loops are much faster in C++!
